Adaptive Caching of Input / Output Data

ABSTRACT

To improve caching techniques, so as to realize greater hit rates within available memory, of the present invention utilizes a entropy signature from the compressed data blocks to supply a bias to pre-fetching operations. The method of the present invention for caching data involves detecting a data I/O request, relative to a data object, and then selecting appropriate I/O to cache, wherein said selecting can occur with or without user input, or with or without application or operating system preknowledge. Such selecting may occur dynamically or manually. The method further involves estimating an entropy of a first data block to be cached in response to the data I/O request; selecting a compressor using a value of the entropy of the data block from the estimating step, wherein each compressor corresponds to one of a plurality of ranges of entropy values relative to an entropy watermark; and storing the data block in a cache in compressed form from the selected compressor, or in uncompressed form if the value of the entropy of the data block from the estimating step falls in a first range of entropy values relative to the entropy watermark. The method can also include the step of prefetching a data block using gap prediction with an applied entropy bias, wherein the data block is the same as the first data block to be cached or is a separate second data block. The method can also involve the following additional steps: adaptively adjusting the plurality of ranges of entropy values; scheduling a flush of the data block from the cache; and suppressing operating system flushes in conjunction with the foregoing scheduling step.

RELATED APPLICATION DATA

This application is a continuation of U.S. patent application Ser. No.11/152,363, filed on Jun. 14, 2005, entitled “Adaptive Input/OutputCompressed System and Data Cache and System Using Same”, invented byJohn E. Kellar, which claims benefit of priority of U.S. provisionalapplication Ser. No. 60/579,344 titled “Adaptive Input/Output Cache andSystem Using Same,” filed Jun. 14, 2004, and which are all herebyincorporated by reference in their entirety as though fully andcompletely set forth herein.

FIELD OF THE INVENTION

The present invention relates, in general, to data processing systemsand more particularly to adaptive data caching in data processingsystems to reduce transfer latency or increase transfer bandwidth ofdata movement within these systems.

DESCRIPTION OF THE RELATED ART

In modern data processing systems, the continual increase in processorspeeds has outpaced the rate of increase of data transfer rates fromperipheral persistent data storage devices and sub-systems. In systemssuch as enterprise scale server systems in which substantial volumes ofvolatile, or persistent data are manipulated, the speed at which datacan be transferred may be the limiting factor in system efficiency.Commercial client/server database environments are emblematic of suchsystems. These environments are usually constructed to accommodate alarge number of users performing a large number of sophisticateddatabase queries and operations to a large distributed database. Thesecompute, memory and I/O intensive environments put great demands ondatabase servers. If a database client or server is not properlybalanced, then the number of database transactions per second that itcan process can drop dramatically. A system is considered balanced for aparticular application when the CPU(s) tends to saturate about the sametime as the I/O subsystem.

Continual improvements in processor technology have been able to keeppace with ever-increasing performance demands, but the physicallimitations imposed on retrieving data from disk has caused I/O transferrates to become an inevitable bottleneck. Bypassing these physicallimitations has been an obstacle to overcome in the quest for betteroverall system performance.

In the computer industry, this bottleneck, known as a latency gapbecause of the speed differential, has been addressed in several ways.Caching the data in memory is known to be an effective way to diminishthe time taken to access the data from a rotating disk. Unfortunately,memory resources are in high demand on many systems, and traditionalcache designs have not made the best use of memory devoted to them. Forinstance, many conventional caches simply cache data existing ahead ofthe last host request. Implementations such as these, known as ReadAhead caching, can work in unique situations, but for non-sequentialread requests, data is fruitlessly brought into the cache memory. Thisblunt approach to caching however has become quite common due tosimplicity of the design. In fact, this approach has been put in use asread buffers within the persistent data storage systems such as disksand disk controllers.

Encoding or compressing cached data in operating system, system cachesincrease the logical effective cache size and cache hit rate, and thusimproves system response time. On the other hand, compressed datarequires variable-length record management, free space search andgarbage collection. This overhead may negate performance improvementsachieved by increasing effective cache size. Thus, there is a need for anew operating system, system file, data and buffer cache data managingmethod with low overhead, transparent to the operating systems inconventional data managing methods. With such an improved method, it isexpected that the effective, logically accessible, memory available forfile and data buffer cache size will increase by 30% to 400%,effectively improving system-cost performance.

Ideally, a client should not notice any substantial degradation inresponse time for a given transaction even as the number of transactionsrequested per second by other clients to the database server increases.The availability of main memory plays a critical role in a databaseserver's ability to scale for this application. In general, a databaseserver will continue to scale up until the point that the applicationdata no longer fits in main memory. Beyond this point, the buffermanager resorts to swapping pages between main memory and storagesub-systems. The amount of this paging increases exponentially as afunction of the fraction of main memory available, causing applicationperformance and response time to degrade exponentially as well. At thispoint, the application is said to be I/O bound.

When a user performs a sophisticated data query, thousands of pages maybe needed from the database, which is typically distributed across manystorage devices, and possibly distributed across many systems. Tominimize the overall response time of the query, access times must be assmall as possible to any database pages that are referenced more thanonce. Access time is also negatively impacted by the enormous amount oftemporary data that is generated by the database server, which normallycannot fit into main memory, such as the temporary files generated forsorting. If the buffer cache is not large enough, then many of thosepages will have to be repeatedly fetched to and from the storagesub-system.

Independent studies have shown that when 70% to 90% of the working datafits in main memory, most applications will run several times slower.When only 50% fits, most run 5 to 20 times slower. Typical relationaldatabase operations run 4 to 8 times slower when only 66% of the workingdata fits in main memory. The need to reduce or eliminate applicationpage faults, data or file system I/O is compelling. Unfortunately forsystem designers, the demand for more main memory by databaseapplications will continue to far exceed the rate of advances in memorydensity. Coupled with this demand from the application area comescompeting demands from the operating system, as well as associated I/Ocontrollers and peripheral devices. Cost-effective methods are needed toincrease the, apparent, effective size of system memory.

It is difficult for I/O bound applications to take advantage of recentadvances in CPU, processor cache, Front Side Bus (FSB) speeds, >100 Mbitnetwork controllers, and system memory performance improvements (e.g.,DDR2) since they are constrained by the high latency and low bandwidthof volatile or persistent data storage subsystems. The most common wayto reduce data transfer latency is to add memory. Adding memory todatabase servers may be expensive since these applications demand a lotof memory, or may even be impossible, due to physical system constraintssuch as slot limitations. Alternatively, adding more disks and diskcaches with associated controllers, or Network Attached Storage (NAS)and network controllers or even Storage Aware Network (SAN) devices withHost Bus Adapters (HBA's) can increase storage sub-system request anddata bandwidth. It may be even necessary to move to a larger server withmultiple, higher performance I/O buses. Memory and disks are added untilthe database server becomes balanced.

First, the memory data encoding/compression increases the effective sizeof system wide file and/or buffer cache by encoding and storing a largeblock of data into a smaller space. The effective available reach ofthese caches is typically doubled, where reach is defined as the totalimmediately accessible data requested by the system, without recourse toout-of-core (not in main memory) storage. This allows client/serverapplications, which typically work on data sets much larger than mainmemory, to execute more efficiently due to the decreased number ofvolatile, or persistent, storage data requests. The numbers of datarequests to the storage sub-systems are reduced because pages or diskblocks that have been accessed before are statistically more likely tostill be in main memory when accessed again due to the increasedcapacity of cache memory. A secondary effect of such compression orencoding is reduced latency in data movement due to the reduced size ofthe data. Basically, the average compression ratio tradeoff against theoriginal data block size as well as the internal cache hash bucket sizemust be balanced in order to reap the greatest benefit from thistradeoff. The Applicant of the present invention believes that anoriginal uncompressed block size of 4096 bytes with an averagecompression ratio of 2:1 stored internally in the cache, in a datastructure known as an open hash, in blocks of 256 bytes results in thegreatest benefit towards reducing data transfer latency for datamovement across the north and south bridge devices as well as to andfrom the processors across the Front-Side-Bus. The cache must be able tomodify these values in order to reap the greatest benefits from thissecond order effect.

There is a need to improve caching techniques, so as to realize greaterhit rates within the available memory of modern systems. Current hitrates, from methods such as LRU (Least Recently Used), LFU (LeastFrequently Used), GCLOCK and others, have increased very slowly in thepast decade and many of these techniques do not scale well with theavailability of the large amounts of memory that modern computer systemshave available today. To help meet this need, the present inventionutilizes a entropy signature from the compressed data blocks to supply abias to pre-fetching operations. This signature is produced from theentropy estimation function described herein, and stored in the tagstructure of the cache. This signature provides a unique way to grouppreviously seen data; this grouping is then used to bias or alter thepre-fetching gaps produced by the prefetching function described below.Empirical evidence shows that this entropy signature improvepre-fetching operations over large data sets (greater than 4 GBytes ofaddressable space) by approximately 11% over current techniques that donot have this feature available.

There is also a need for user applications to be able to access thecapabilities for reducing transfer latency or increasing transferbandwidth of data movement within these systems. There is a further needto supply these capabilities to these applications in a transparent way,allowing an end-user application to access these capabilities withoutrequiring any recoding or alteration of the application. The Applicantof the present invention believes this may be accomplished through anin-core file-tracking database maintained by the invention. Such a corefile-tracking data base would offer seamless access to the capabilitiesof the invention by monitoring file open and close requests from theuser-application/operating system interface, decoding the file accessflags, while maintaining an internal list of the original file objectname and flags, and offering the capabilities of the invention toappropriate file access. The in-core file-tracking database would alsoallow the end-user to over-ride an application's caching request andeither allow or deny write-through or write-back or non-conservative orno-caching to an application on a file by file basis, through the use ofmanual file tracking or, on a system wide basis, through the use ofdynamic file tracking. This capability could also be offered in a moreglobal, system-wide way by allowing caching of file system metadata;this caching technique (the caching of file system metadataspecifically) is referred to throughout this document as“non-conservative caching.”

There is a further need to allow an end-user application to seamlesslyaccess PAE (Physical Address Extension) memory for use in filecaching/data buffering, without the need to re-code or modify theapplication in any way. The PAE memory addressing mode is limited to theIntel, Inc. ×86 architecture. There is a need for replacement of theunderlying memory allocator to allow a PAE memory addressing mode tofunction on other processor architectures. This would allow end-userapplications to utilize the modern memory addressing capabilitieswithout the need to re-code or modify the end-user application in anyway. This allows transparent seamless access to PAE memory, for use bythe buffer and data cache, without user intervention or systemmodification.

Today, large numbers of storage sub-systems are added to a server systemto satisfy the high I/O request rates generated by client/serverapplications. As a result, it is common that only a fraction of thestorage space on each storage device is utilized. By effectivelyreducing the I/O request rate, fewer storage sub-system caches and diskspindles are needed to queue the requests, and fewer disk drives areneeded to serve these requests. The reason that the storage sub-systemspace is not efficiently utilized is that, on today's hard-disk, storagesystems, access latency increases as the data written to the storagesub-system moves further inward from the edge of the magnetic platter,in order to keep access latency at a minimum system designersover-design storage sub-systems to take advantage of this phenomenon.This results in under-utilization of available storage. There is a needto reduce average latency to the point that this trade-off is notneeded, resulting in storage space associated with each disk that can bemore fully utilized at an equivalent or reduced latency penalty.

In addition, by reducing the size of data to be transferred betweenlocal and remote persistent storage and system memory, the I/O and FrontSide Buses (FSB) are utilized less. This reduced bandwidth requirementcan be used to scale system performance beyond its originalcapabilities, or allow the I/O subsystem to be cost reduced due toreduced component requirements based on the increased effectivebandwidth available.

Thus, there is a need in the art for mechanisms to balance the increasesin clock cycles of the CPU and data movement latency gap without theneed for adding additional volatile or persistent storage and memorysub-systems or increasing the clock cycle frequency of internal systemand I/O buses. Furthermore, there is a need to supply this capabilitytransparently to end user applications so that they can take advantageof this capability in both a dynamic and a directed way.

SUMMARY OF THE INVENTION

There is a need to improve caching techniques, so as to realize greaterhit rates within the available memory of modern systems. Current hitrates, from methods such as LRU (Least Recently Used), LFU (LeastFrequently Used), GCLOCK and others, have increased very slowly in thepast decade and many of these techniques do not scale well with theavailability of the large amounts of memory that modern computer systemshave available today. To help meet this need, the present inventionutilizes a entropy signature from the compressed data blocks to supply abias to pre-fetching operations. This signature is produced from theentropy estimation function described herein, and stored in the tagstructure of the cache. This signature provides a unique way to grouppreviously seen data; this grouping is then used to bias or alter thepre-fetching gaps produced by the prefetching function described below.Empirical evidence shows that this entropy signature improvepre-fetching operations over large data sets (greater than 4 GBytes ofaddressable space) by approximately 11% over current techniques that donot have this feature available.

The method for caching data in accordance with the present inventioninvolves detecting a data input/output request, relative to a dataobject, and then selecting appropriate I/O to cache, wherein saidselecting can occur with or without user input, or with or withoutapplication or operating system preknowledge. Such selecting may occurdynamically or manually. The method of the present invention furtherinvolves estimating an entropy of a data block to be cached in responseto the data input/output request; selecting a compressor using a valueof the entropy of the data block from the estimating step, wherein eachcompressor corresponds to one of a plurality of ranges of entropy valuesrelative to an entropy watermark; and storing the data block in a cachein compressed form from the selected compressor, or in uncompressed formif the value of the entropy of the data block from the estimating stepfalls in a first range of entropy values relative to the entropywatermark. The method for caching data in accordance with the presentinvention can also include the step of prefetching a data block usinggap prediction with an applied entropy bias, wherein the data block isthe data block to be cached, as referenced above, or is a separatesecond data block. The method of the present invention can also involvethe following additional steps: adaptively adjusting the plurality ofranges of entropy values; scheduling a flush of the data block from thecache; and suppressing operating system flushes in conjunction with theforegoing scheduling step.

The foregoing has outlined rather broadly the features and technicaladvantages of the present invention in order that the detaileddescription of the invention that follows may be better understood.Additional features and advantages of the invention will be describedhereinafter, which form the subject of the claims of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention may be better understood, and its numerousobjects, features, and advantages made apparent to those skilled in theart by referencing the accompanying drawings. The use of the samereference number throughout the several figures designates a like orsimilar element.

FIG. 1A (prior art) depicts a generalized system architecture of amodern data processing system;

FIG. 1B (prior art) depicts generalized software architecture for theI/O subsystem of Windows 2000, XP, and beyond;

FIG. 2A illustrates a high-level logical view of an adaptive compressedcache architecture in accordance with the present inventive principles;

FIG. 2B illustrates, in more detail, a high-level logical view of anadaptive compressed cache;

FIG. 2C illustrates a logical view of an adaptive compressed cachingarchitecture in accordance with the present inventive principals;

FIG. 2D is a table showing opened file policy for cache in accordancewith an embodiment of the present invention;

FIG. 2E illustrates the flags used for file tracking specifications inaccordance with an embodiment of the present invention;

FIG. 3 illustrates a cache protocol in a state diagram format view inaccordance with the present state of the art principals;

FIG. 4A shows a modified MSI cache protocol, wherein the MSI protocol ismodified in accordance with the present inventive design principals;

FIG. 4B shows state transitions for write-invalidation in accord withthe present inventive design principles;

FIGS. 5 and 6 are flow diagrams illustrating implementation details inaccordance with an embodiment of the present invention;

FIG. 7A-7D are further flow diagrams illustrating implementation detailsin accordance with an embodiment of the present invention;

FIG. 7E is a schematic representation of a data structure in accordancewith an embodiment of the present invention;

FIG. 7F schematically depicts a set of entropy bands about themaximum-entropy watermark which have pre-selected relative widths aboutthe maximum-entropy watermark;

FIG. 7G, 8A-8J are flow diagrams illustrating implementation details inaccordance with an embodiment of the present invention; and

FIG. 9 illustrates an exemplary hardware configuration of a dataprocessing system in accordance with the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

In the following description, numerous specific details are set forthsuch as specific word or byte lengths, etc. to provide a thoroughunderstanding of the present invention. However, it will be obvious tothose skilled in the art that the present invention may be practicedwithout such specific details. In other instances, well-known circuitshave been shown in block diagram form in order not to obscure thepresent invention in unnecessary detail. For the most part, detailsconcerning timing considerations and the like have been omitted inasmuchas such details are not necessary to obtain a complete understanding ofthe present invention and are within the skills of persons of ordinaryskill in the relevant art.

Refer now to the drawings wherein depicted elements are not necessarilyshown to scale and wherein like or similar elements are designated bythe same reference numeral through the several views.

FIG. 1A (prior art) depicts a generalized system architecture of amodern data processing system.

FIG. 1B (prior art) depicts generalized software architecture for theI/O subsystem of Windows 2000, XP, and beyond. This diagram is notintended to be literally accurate, but a generalized view of thesoftware components, and how they exist within the system from ahierarchical point of view. This diagram utilizes the Windows operatingsystem only for illustrative purposes, as the present inventiveembodiment may be implemented in any modern operating system infundamentally the same way. Note that this figure illustrates both afile and data cache, as well as a network controller device cache. Thepresent invention may be adapted to either a network controller deviceor a disk controller device using the same inventive design principlesdiscussed below.

FIG. 2A illustrates a high-level logical view of an adaptive compressedcache architecture in accordance with the present inventive principles.

FIG. 2B illustrates, in more detail, a high-level logical view of anadaptive compressed cache.

FIG. 2C illustrates a logical view of an adaptive compressed cachingarchitecture 100 in accordance with the present inventive principles.Modern data processing systems may be viewed from a logical perspectiveas a layered structure 102 in which a software application 104 occupiesthe top level, with the operating system (OS) application programinterfaces (APIs) 106 between the application and the OS 108. OS APIs106 expose system services to the application 104. These may include,for example, file input/output (I/O), network I/O, etc. Hardware devicesare abstracted at the lowest level 110. Hardware devices (see FIGS. 2Aand 2B) may include the central processing unit (CPU) 112, memory,persistent storage (e.g., disk controller 114), and other peripheraldevices 116. In the logical view represented in FIG. 2C, these arehandled on an equal footing. That is, each device “looks” the same tothe OS.

In accordance with the present inventive principles, filter driver 118intercepts the operating system file access and performs cachingoperations, described further herein below, transparently. That is, thecaching, file tracking and, in particular, the compression associatedtherewith, is transparent to the application 104. Data selected forcaching is stored in a (compressed) cache (denoted as ZCache 120). (The“ZCache” notation is used as a mnemonic device to call attention to thefact that the cache in accordance with the present invention is distinctfrom the instruction/data caches commonly employed in modernmicroprocessor systems, and typically denoted by the nomenclature “L1”,“L2” etc. cache. Furthermore the Z is a common mnemonic used to indicatecompression or encoding activity.) In an embodiment of the presentinvention, ZCache 120 may be physically implemented as a region in mainmemory. Filter 118 maintains a file tracking database (DB) 122 whichcontains information regarding which files are to be cached or notcached, and other information useful to the management of file I/Ooperations, as described further herein below. Although logically partof filter driver 118, physically, file tracking DB 122 may be includedin ZCache 120.

A few notes on FIG. 2C:

1) The preferred embodiment of the File system driver layers itselfbetween boxes #2 (I/O Manager Library) and #18 (FS Driver).

2) The disk filter layers itself between boxes #18 (FS Driver) and theboxes in the peer group depicted by #19 (Disk Class), #20 (CD-ROMClass), and #21 (Class).

3) The ZCache module exists as a stand-alone device driver adjunct tothe file system filter and disk filter device drivers.

4) A TDI Filter Driver is inserted between box (TDI) 8, with connectiontracking for network connections that operates the same as the filetracking modules in the compressed data cache, and the peer group ofmodules that consist of (AFD) 3, (SRV) 4, (RDR) 5, (NPFS) 6, and (MSFS)7. A complete reference on TDI is available on the Microsoft MSDNwebsite at

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/network/hh/network/303tdi.sub.--519j.asp,which is incorporated herein by reference.

5) A NDIS intermediate cache driver is inserted between the bottom edgeof the transport drivers and the upper edge of the NDIS components.

FIG. 3 illustrates a cache protocol, in a state diagram format view andin accordance with the present state of the art principles. This statediagram describes the Modified-Shared-Invalid (MSI) Cache protocol. Thiscache protocol is one used on processor caches, and is closest to whatis needed for a block-based cache. Other possible cache protocols whichare not precluded by this preferred embodiment include MESI, MOESI,Dragon and others.

The definitions of the states shown in FIG. 2 are:

1) Invalid: The cache line does not contain valid data.

2) Shared: The cache line contains data, which is consistent with thebacking store in the next level of the memory hierarchy.

3) Modified: The cache line contains the most recent data, and isdifferent than data contained in backing store.

FIG. 4A shows the modified MSI cache protocol. In accordance withpresent inventive design principles, the MSI protocol must be modified,as in FIG. 4A, to accomplish the present inventive design goals. Manyfactors are considered in development of caching protocols, and most ofthe above-mentioned cache protocols are of a general purpose only, orare designed for a specific target implementation, such as a processor(CPU) cache. In order to meet the design goals of the present inventiveprinciples other cache protocol factors, rather than only those embodiedby the MSI protocol, must be considered.

Other caching protocols factors to consider are:

1) Read/Write ordering consistency

2) Allocate on Write Policy

3) Write-through, Write-Back, and Non-cacheable attributes

4) Blocking vs. a Non-Blocking design

5) Support for hardware codec support

6) Squashing support to save I/O Requests

Another important item to consider when applying this concept to theinvention's cache protocol is the high latencies associated with issuingand completing disk I/Os. It is necessary to break apart the MSI Sharedand Modified states to take into consideration the following cases:

1) A cache line is allocated, but the disk I/O may not complete inhundreds if not thousands of microseconds. During this time, additionalI/O requests could be made against the same allocated cache line.

2) Dynamically changing cache policies based on file-stream attributes,in different process contexts.

3) Take maximum advantage of the Asynchronous I/O model.

Application of these considerations is shown in the state diagram FIG.4B, which shows state transitions for write-invalidation in accord withthe present inventive design principles.

Many operating systems have features that can be exploited for maximumperformance benefit. As previously mentioned, some of these feature areAsynchronous I/O models, I/O Request Packets or IRPS that can be pended,managed and queued by intermediate drivers, and internal listmanipulation techniques, such as look-aside lists or buddy lists. Thesefeatures may vary slightly from operating system to operating system;none of these features are precluded or required by the presentinventive design principles.

Refer now to FIG. 5 that illustrates in flow chart form, an adaptive,transparent compression caching methodology 200 in accordance with thepresent inventive principles. In the logical view of FIG. 2C,methodology 200 may be primarily performed by filter driver 118, oralternatively, may be logically between filter driver 118 and ZCachedriver 120.

Methodology 200 watches for I/O operations involving data block moves,in step 502. See FIG. 5. As illustrated in FIG. 6, a data block move maybe detected by “peeking” at, or disassembling, the I/O request packetsthat control the handling of I/O operations. If an I/O operationinvolving data block moves is detected, methodology 200 performsoperations to determine if the subject data is to be cached. This isdescribed in conjunction with step 204 of FIG. 6 and steps 204-214 ofFIG. 7A. In general, caching decisions are based on user-selectablecaching policies in combination with caching “instructions” that may beset by the application making the data transfer request. Step 204instructs the operating system how a I/O operation should be handled. Inparticular, each I/O packet includes descriptive data that may includeinformation (i.e. “flags”) for control the caching of the datatransported in the packet.

Firstly, the user may specify a list of files to be ignored. If, in step204, the subject file of the data move is in the “ignored” list, process200 returns to step 208 to continue to watch for data block moves.Otherwise, in step 206, it is determined if caching is turned off inaccordance with a global caching policy. As discussed in conjunctionwith FIG. 2C, a file-tracking database 122 (equivalently, a filetracking “registry”) may be maintained in accordance with cachingarchitecture 100. This registry may include a set of file tracking flags20, FIG. 2E. In an embodiment of file tracking flags 20, each entry maybe a hexadecimal (hex) digit. GlobalPolicy flag 21, which may be set bythe user, may be set to determine the global policy that determines themost aggressive policy for any file. In other words, as describedfurther below, other parameters may override the global policy to reducethe aggressiveness for a particular file. The values GlobalPolicy flag21 may take predetermined values (e.g., a predetermined hex digit)representing respective ones of a writeback policy, writethrough policyand no caching. Writeback caching means that a given I/O write requestmay be inserted in the ZCache instead of immediately writing the data tothe persistent store. Writethrough caching means that the data is alsoimmediately written to the persistent store. If, in step 206, caching isturned off, such as if GlobalPolicy flag 21 is set to a predeterminedhex value representing “no cache,” process 200 passes the I/O request tothe operating system (OS) for handling, step 208. Otherwise, process 200proceeds to step 210.

In decision block 210, it is determined if dynamic, manual oralternatively, non-conservative tracking is set. This may be responsiveto a value of Dynamic flag 28, FIG. 2E. In an embodiment of the presentinvention, if the value of the flag is “writethrough,” the dynamictracking is enabled, and if the value of the flag is “no cache,” manualtracking is enabled. (Manual tracking allows the user to explicitly listin the file tracking database which files are to be cached.) In dynamicmode, if, in step 212 the subject file is a tracked file, it is cachedin the ZCache in accordance with cache policy (either as writethrough orwriteback). File flags associated with the subject file are ignored inmanual mode and honored in dynamic mode. In particular, in a Windows NTenvironment, a FO_NO_INTERMEDIATE_BUFFERING flag is ignored in manualmode (and honored in dynamic mode), and likewise an analogous flag inother OS environments. If the subject file is an untracked file, process200 proceeds to step 214.

Untracked files include metadata and files that may have been openedbefore the caching process started. Metadata files are files thatcontain descriptions of data such as information concerning the locationof files and directories; log files to recover corrupt volumes and flagswhich indicate bad clusters on a physical disk. Metadata can represent asignificant portion of the I/O to a physical persistent store becausethe contents of small files (e.g., <4,096 bytes) may be completelystored in metadata files. In step 214 it is determined ifnon-conservative caching is enabled. In an embodiment of the presentinvention using file tracking flags 21, FIG. 2E, step 214 may beperformed by examining Default flag 24, FIG. 2D. If the value of Defaultflag 24 is the hex digit representing “writeback,” then non-conservativecaching is enabled, and decision block 214 proceeds by the “Y” branch.Conversely, if the value of Default flag 24 is the hex digitrepresenting “no cache,”, then non-conservative caching is disabled, anddecision block 214 proceeds by the “N” branch, and the respective fileoperation is passed to the OS for handling (step 208).

In step 214, it is determined if the subject file is a pagefile. If so,in step 214 it is determined if caching of pagefiles is enabled. Theflag 28 (FIG. 2E) has the value representing page file I/O. Pagefile I/Ois passed to the OS for handling.

Process 200 having determined that the subject data is to be cached, instep 220 file object information is extracted from the I/O requestpacket and stored in the file tracking DB, step 222 (FIG. 6). Such datamay include any policy flags set by the application issuing the subjectI/O request. If, for example, in a Windows NT environment, theFO_WRITE_THROUGH flag is set in the packet descriptor the WRITE_THROUGHflag 28, FIG. 2E, may be set in step 222. Similarly, if theFO_NO_INTERMEDIATE_BUFFERING is set in the I/O request packet, then theNO_BUFF flag 28 may be set in step 222. Additionally, sequential fileaccess flags, for example, also may be stored.

In FIG. 7B, if the I/O request is a write, process 200 proceeds by the“Y” branch in step 224, to step 226. If the request is not a writerequest, decision block 224 proceeds by the “No” branch to decisionblock 228, to determine if the request is a read.

In step 226 (FIG. 7C), storage space in the ZCache is reserved, and instep 230, a miss counter associated with the subject data block to becached is cleared. Each such block may have a corresponding tag thatrepresents a fixed-size block of data. For example, a block size, whichis normally equivalent to the PAGE SIZE of a computer processor thatwould execute the instructions for carrying out the method of thepresent invention, of 4,096 bytes may be used in an embodiment of thepresent invention, however other block sizes may be used in accordancewith the present inventive principles, as shown in FIG. 7E,schematically illustrating a block tag 300 which may be stored in thefile tracking database Block tag 300 may be viewed as a data structurehaving a plurality of members including counter member 302 includingmiss counter 304. Counter member 302 may, in an embodiment of thepresent invention may be one-byte wide, and miss counter 304 may be onebit wide (“true/false”). The operation of the miss counter will bediscussed further herein below.

In step 232 (FIG. 7C), a compression estimation is made. The amount ofcompression that may be achieved on any particular block is determinedby the degree of redundancy in the data block, in accordance with theclassic theory information of Shannon. A block of data that is perfectlyrandom has a maximum entropy in this picture and does not compress. Anestimation of the entropy of the subject block may be used as a measureof the maximum compression that may be achieved for that block.Different data compression techniques are known in the art, and the“better” the compressor, the closer the compression ratio achieved willbe to the entropy-theoretic value. However, the greater compressioncomes at the price of computational complexity, or, equivalently, CPUcycles. Thus, although memory may be saved by the higher compressionratios, the savings may come at the price of reduced responsivenessbecause of the added CPU burden. In other words, different compressionschemes may be employed to trade off space and time. In an embodiment ofthe present invention, an entropy estimate may be made using a frequencytable for the data representation used. Such frequency tables are usedin the cryptographic arts and represent the statistical properties ofthe data. For example, for ASCII data, a 256-entry relative frequencytable may be used. Frequency tables are often used in cryptography andcompression; they are pre-built tables used for predicting theprobability frequency of presumed alphabetic token occurrences in a datastream. In this embodiment, the token stream is presumed to beASCII-encoded tokens, but is not restricted to this. For computationalconvenience, the entropy may be returned as a signed integer value inthe range.+−.50. A maximal entropy block would return the value 50. Theentropy estimate may also be stored in the block tag (tag member 310,FIG. 3). The value of the entropy estimate may be used to select acompressor, step 234 or the value of the entropy estimate may also beused to provide a bias to pre-fetching for previously seen read datablocks.

In step 234, which may be viewed as a three-way decision block if threelevels of compression are provided, the subject data block is compressedusing an entropy estimate based compressor selection. This may befurther understood by referring to FIG. 7F. FIG. 7F schematicallydepicts a set of entropy bands about the maximum-entropy watermark(which may correspond to a value of zero for a random block) which havepre-selected relative widths about the maximum entropy watermark. Thus,bands 402 a and 402 b are shown with a width of 6%, and represent ablock that deviates by a relatively small amount from a random block andwould be expected to benefit little from compression. Therefore, in step234, FIG. 7G, zero compression, 236 may be selected. In other words,such a block may be cached without compression. If the entropy estimatereturns a value in bands 404 a, 404 b, shown with a width of 19%, azero-bit compressor 238, FIG. 2C may be selected. A zero-bit compressor,counts the number of zeros occurring before a one occurs in the word.The zeros are replaced by the value representing the number of zeros. Ifthe entropy estimate returns a value in bands 406 a, 406 b, having anillustrated width of 25%, a more sophisticated compression may be used,as the degree of compression expected may warrant the additional CPUcycles that such a compressor would consume. In step 234, a compressorof the Lempel-Ziv (LZ) type 240 may be selected. LZ type compressors arebased on the concept, described by A. Ziv and J. Lempel in 1977, ofparsing strings from a finite alphabet into substrings of differentlengths (not greater than a predetermined maximum) and a coding schemethat maps the substrings into code words of fixed length, alsopredetermined. The substrings are selected so they have about equalprobability of occurrence. Algorithms for implementing LZ typecompression are known in the art, for example, the lzw algorithmdescribed in U.S. Pat. No. 4,558,302 issued Dec. 10, 1985 to Welch andthe lzo compressors of Markus F. X. J. Oberhumer,http://www.oberhumer.com/, which are incorporated herein by reference.The type of compressor used and the compression ratio attained may bestored in the block tag, FIG. 7E, 312, 314, respectively. Bands may beadded for other compressor types known to the art such asBurroughs-Wheeler (BWT) or PPM (Prediction by Partial Match).

Moreover, the bands may be adaptively adjusted. If, for example, the CPUis being underutilized, it may be advantageous to use a more aggressivecompressor, even if the additional compression might not otherwise beworth the tradeoff. In this circumstance, the width of bands 404 a, band 406 a, b may be expanded. Conversely, if CPU cycles are at a premiumrelative to memory, it may be advantageous to increase the width ofbands 402 a, b, and shrink the width of bands 406 a, b. A methodologyfor adapting the compressor selection is described in conjunction withFIG. 7F.

In FIGS. 8A-8J, the data is cached, and any unused space reserved isfreed. It is determined if the cached data block previously existed onthe persistent store (e.g., disk). If not, an I/O packet of equal sizeto the uncompressed data block is issued to the persistent store. Inthis way, the persistent store reserves the space for a subsequentflush, which may also occur if the OS crashes. Additionally, if a readcomes in the block will be returned without waiting for the I/O packetrequest to complete, in accordance with the writeback mechanism. If theblock previously existed on the persistent store, or if the cache policyfor the block is writethrough (overriding the writeback default), theblock is written to the persistent store. Otherwise, the block isscheduled for a flush. Additionally, a “write squashing” may beimplemented whereby flushes coming through from the OS are suppressed.In this way, process 200 may lay down contiguous blocks at one time, toavoid fragmenting the persistent store. Process 200 then returns to step208.

Returning to step 288 in FIG. 7B, if the request is a read request, inFIG. 7E, the prefetch and miss counters of the subject block are reset,and reference counters for all blocks updated. A methodology forupdating the reference counter for a block will be described inconjunction with FIG. 7D, below. In step 258 (FIG. 7E), it is determinedif the block has been previously read. This may be determined by anon-zero access count in number of accesses member 316, FIG. 7E.

If the block has been previously read, in step 260 it is determined if agap prediction is stored in the tag (e.g., gap prediction member 318,FIG. 7E). Gap prediction is accomplished by testing the distance inLogical Blocks (LBN's) from one read request in a file to a subsequentread request on the same file, if the LBN's are not adjacent (e.g. eachread takes place at the next higher or lower LBN associated with thisfile) but there is a regular skip pattern (e.g., a read is done, some,regular, number of LBN's is skipped, either positively or negatively, asubsequent read is issued at this skipped distance) that has beendetected from at least two previous reads of this file. If gapprediction has been detected then prefetching will continue as if normalsequential access had been detected, to the length of the gap. If so, instep 260 it is determined if a reference counter in the next block inthe sequence is smaller than two. If a block that has been prefetched isnot hit in the next two references, then it will not be prefetchedagain, unless its entropy estimation is approximately equal plus orminus 2% (this value is arrived at empirically and may be different fordifferent operating systems or platforms) to the entropy of thepreviously fetched block, and process 200 bypasses step 264.

Otherwise, in step 264 the next sequential block is prefetched and aprefetch counter is set for the block. Referring to FIG. 7E, countermember 302 may, in an embodiment of the present invention, be one-bytewide, and may contain a prefetch counter 306 which may be one bit wide(“true/false”).

Returning to step 258, if the block has not been previously read, inFIG. 7F an entropy estimate is made for the block (using the sametechnique as in step 232) that is stored in the file tracking database(e.g., in compression estimate member 310, FIG. 7E). A next block isthen selected for prefetching based on entropy and distance (FIG. 7E).That is, of the blocks nearest in entropy (once again within 2%), theclosest block in distance to the subject block of the read request isprefetched. (Recall that a block has a unique entropy value, but a givenentropy value may map into a multiplicity of blocks.) If, however, inFIG. 7E the miss counter for the selected block is set, prefetching ofthat block is bypassed (“Y” branch of decision block). Otherwise, instep 274, the block is prefetched, and the miss counter (e.g., misscounter 304, FIG. 7E) for the prefetched block is set (or logically“True”). The prefetch counter is set in step 266, as before.

Similarly, if there is no gap prediction, a prefetch based on solely onentropy is performed via the “No” branch of decision block 260.

In step 204 the read is returned.

FIG. 9 illustrates an exemplary hardware configuration of dataprocessing system 700 in accordance with the subject invention. Thesystem in conjunction with the methodologies illustrated in FIG. 5 andarchitecture 100, FIG. 2C may be used for data caching in accordancewith the present inventive principles. Data processing system 700includes central processing unit (CPU) 710, such as a conventionalmicroprocessor, and a number of other units interconnected via systembus 712. Data processing system 700 may also include random accessmemory (RAM) 714, read only memory (ROM) (not shown) and input/output(I/O) adapter 722 for connecting peripheral devices such as disk units720 to bus 712. System 700 may also include communication adapter forconnecting data processing system 700 to a data processing network,enabling the system to communicate with other systems. CPU 710 mayinclude other circuitry not shown herein, which will include circuitrycommonly found within a microprocessor, e.g., execution units, businterface units, arithmetic logic units, etc. CPU 710 may also reside ona single integrated circuit.

Preferred implementations of the invention include implementations as acomputer system programmed to execute the method or methods describedherein, and as a computer program product. According to the computersystem implementation, sets of instructions for executing the method ormethods are resident in the random access memory 714 of one or morecomputer systems configured generally as described above. These sets ofinstructions, in conjunction with system components that execute themmay perform operations in conjunction with data block caching asdescribed hereinabove. Until required by the computer system, the set ofinstructions may be stored as a computer program product in anothercomputer memory, for example, in disk drive 720 (which may include aremovable memory such as an optical disk or floppy disk for eventual usein the disk drive 720). Further, the computer program product can alsobe stored at another computer and transmitted to the user's workstationby a network or by an external network such as the Internet. One skilledin the art would appreciate that the physical storage of the sets ofinstructions physically changes the medium upon which is the stored sothat the medium carries computer-readable information. The change may beelectrical, magnetic, chemical, biological, or some other physicalchange. While it is convenient to describe the invention in terms ofinstructions, symbols, characters, or the like, the reader shouldremember that all of these in similar terms should be associated withthe appropriate physical elements.

Note that the invention may describe terms such as comparing,validating, selecting, identifying, or other terms that could beassociated with a human operator. However, for at least a number of theoperations described herein which form part of at least one of theembodiments, no action by a human operator is desirable. The operationsdescribed are, in large part, machine operations processing electricalsignals to generate other electrical signals.

1. A method for caching data comprising: detecting a data input/output(I/O) request, relative to a data object; selecting appropriate I/O tocache, wherein said selecting can occur with or without user input, orwith or without application or operating system preknowledge; estimatingan entropy of a data block to be cached in response to the datainput/output request; selecting a compressor using a value of theentropy of the data block from the estimating step, wherein eachcompressor corresponds to one of a plurality of ranges of entropy valuesrelative to an entropy watermark; storing the data block in a cache incompressed form from the selected compressor, or in uncompressed form ifthe value of the entropy of the data block from the estimating stepfalls in a first range of entropy values relative to the entropywatermark; and prefetching the data block using gap prediction with anapplied entropy bias.
 2. The method of claim 1 further comprisingadaptively adjusting the plurality of ranges of entropy values.
 3. Themethod of claim 1 further comprising scheduling a flush of the datablock from the cache.
 4. The method of claim 3 further comprisingsuppressing operating system flushes in conjunction with the schedulingstep.
 5. The method of claim 1, wherein said selecting occursdynamically.
 6. The method of claim 1, wherein said selecting occursmanually.
 7. A method for caching data comprising: detecting a datainput/output (I/O) request, relative to a data object; selectingappropriate I/O to cache, wherein said selecting can occur with orwithout user input, or with or without application or operating systempreknowledge; estimating an entropy of a first data block to be cachedin response to the data input/output request; selecting a compressorusing a value of the entropy of the first data block from the estimatingstep, wherein each compressor corresponds to one of a plurality ofranges of entropy values relative to an entropy watermark; storing thefirst data block in a cache in compressed form from the selectedcompressor, or in uncompressed form if the value of the entropy of thefirst data block from the estimating step falls in a first range ofentropy values relative to the entropy watermark; and prefetching asecond data block using gap prediction with an applied entropy bias. 8.The method of claim 7 further comprising adaptively adjusting theplurality of ranges of entropy values.
 9. The method of claim 7 furthercomprising scheduling a flush of the data block from the cache.
 10. Themethod of claim 9 further comprising suppressing operating systemflushes in conjunction with the scheduling step.
 11. The method of claim7, wherein said selecting occurs dynamically.
 12. The method of claim 7,wherein said selecting occurs manually.
 13. One or more computer programproducts readable by a machine and containing instructions forperforming the method contained in claim
 1. 14. One or more computerprogram products readable by a machine and containing instructions forperforming the method contained in claim 7.